Application of various machine learning techniques to predict obstructive sleep apnea syndrome severity

As the incidence of obstructive sleep apnea syndrome (OSAS) increases worldwide, the need for a new screening method that can compensate for the shortcomings of the traditional diagnostic method, polysomnography (PSG), is emerging. In this study, data from 4014 patients were used, and both supervised and unsupervised learning methods were used. Clustering was conducted with hierarchical agglomerative clustering, K-means, bisecting K-means algorithm, Gaussian mixture model, and feature engineering was carried out using both medically researched methods and machine learning techniques. For classification, we used gradient boost-based models such as XGBoost, LightGBM, CatBoost, and Random Forest to predict the severity of OSAS. The developed model showed high performance with 88%, 88%, and 91% of classification accuracy for three thresholds for the severity of OSAS: Apnea-Hypopnea Index (AHI) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge $$\end{document}≥ 5, AHI \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge $$\end{document}≥ 15, and AHI \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge $$\end{document}≥ 30, respectively. The results of this study demonstrate significant evidence of sufficient potential to utilize machine learning in predicting OSAS severity.

and clustering before classification 16 . In this study, experiments are conducted using a variety of methods, from techniques used in machine learning to methods suggested by medical studies. Accuracy is calculated through comparison with AHI measured from actual PSG and through the calculated accuracy, we compare the utility of models according to the severity of OSAS.

Methods
Data acquisition and ethics declarations. The data used were collected from patients who visited the sleep clinic of Samsung Medical Center between 2014 and 2021. The data include personal information, such as gender, age, height, and weight, as well as physical measurements(abdominal circumference, neck circumference, hip circumference, etc.) and results of self-report questionnaires(Epworth Sleepiness Scale(ESS), Insomnia Severity Index(ISI), etc.) PSG was performed with an Embla N7000 (Medcare-Embla, Reykjavik, Iceland), and the results from the machine's automated scoring system were used to determine OSAS. AHI was measured as the number of episodes of apnea and hypopnea per hour. PSG features were also collected. The workflow of the predictive models is shown in Fig. 1.
For the software tools, the open-source programming language Python (version 3.9.9; Python Software Foundation, Delaware, USA) was used in all the processes of the study. SciPy 17 package (version 1.8.1) was mainly used for statistical analysis, and scikit-learn 18 library (version 1.1.2) was mainly used to develop the predictive models. The study protocol was approved by the institutional review board of Samsung Medical Center (IRB no. 2022-07-003), and the entire process of the study was performed in accordance with the ethical standards of the Declaration of Helsinki. The waiver of informed consent was approved by the institutional review board of Samsung Medical Center since this work is a retrospective study that only involves anonymous patient data.
Data pre-processing. The processed data consists of 4014 samples and is described by 33 numerical or categorical features. The main characteristics of the dataset are shown in Table 1. The OSAS severity of the dataset was classified into 4 classes corresponding to the severity level defined by the American Academy of Sleep Medicine Task Force 19 . For the classification, 20% of the dataset was used as test data. Each classifier was trained with 5-fold cross-validation with the train dataset. Among input features, numerical features were analyzed for normal distribution using the Kolmogorov-Smirnov's test. In the case of the normal distribution, Student's t-test was performed, and in the case of not, the Mann-Whitney U test was conducted. For categorical features, the chi-square test was operated. A p-value of less than 0.05 was considered significant.

Clustering.
A combination of mutual information (MI) and recursive feature elimination (RFE) 20 strategy on LightGBM was applied as feature selection methods for clustering. MI is a metric that indicates the interdependence between two variables, and RFE is a feature selection method that starts with all input features and  www.nature.com/scientificreports/ removes less important features one by one as learning repeats. In the feature selection process, MI was computed to filter less informative variable. The threshold for filtering was set as the mean of the mutual information score. RFE was applied to finally determine the number of features for clustering. For clustering algorithms, hierarchical agglomerative clustering, K-means, bisecting K-means algorithm, and Gaussian mixture model were used. The algorithms that automatically assign the number of clusters all had a large number of clusters, which did not fit our purpose of conducting clustering. Therefore, clustering algorithms that need to assign the number of clusters manually were used.
Hierarchical clustering is a common clustering algorithm that builds nested clusters by successively merging or splitting them. Agglomerative clustering is a bottom-up approach for hierarchical clustering. Each point starts with an individual cluster and similar clusters are consecutively merged in the clustering process.
K-means is the most popular clustering algorithm 21 and is known for its simplicity. For finding K clusters, select K points as the initial centroids. Then, assign all points to the nearest centroid and recompute the centroid of each cluster. Repeat these steps until the centroids remain unchanged. Bisecting K-means is a variant of K-means algorithm 22 . Bisecting K-means algorithm uses the basic K-means algorithm to find 2 sub-clusters (bisecting step), and repeats the bisecting step and take the segmentation that produces the clustering with the highest overall similarity.
Gaussian mixture models (GMM) is a probabilistic model which assumes the probability distribution of all subgroups follows the Gaussian distribution 23 .
Feature engineering. Both methods proposed in medical researches and widely used in machine learning were applied as feature engineering techniques. Weighted ESS and a formula for predicting AHI were used as the medical approach, and body proportion data were also added by processing body measurement data in the dataset.   www.nature.com/scientificreports/ Weighted ESS is given different weights for each question of ESS. A recent study has shown that weighted ESS is better at predicting the severity of OSAS than general ESS 24 . Since our dataset includes the response of each ESS item, weighted ESS could be applied.
Following is predictive mathematical formula for AHI we used in this work. 25 . We modified constants using SciPy package to optimize the formula for our dataset. Since the dataset contains two measurements of neck circumference (NC): in sitting and lying positions, the formula was also optimized for those measurements accordingly. In addition, three different criteria were used for determining excessive daytime sleepiness (EDS): the criteria for weighted ESS, the criteria from the American Academy of Sleep Medicine Task Force, and the criteria from the study proposed the predictive formula.
Predictive models. Gradient boosting-based models and random forest are considered as most effective machine learning models for dealing with large amounts of complex data. These algorithms are proven to be not only accurate but also efficient 26,27 . Therefore, in this work, we used random forest and three different models based on gradient boosting, XGBoost, LightGBM, and CatBoost, to enhance classification performance efficiently.
Random forest is a classifier consisting of a combination of decision trees built on random sub-samples of the dataset 28 . Since the classifier is composed of decorrelated decision trees, it is resistant to noises and the overfitting problem.
XGBoost is a gradient boosting-based decision tree ensemble designed to be highly efficient and scalable 29 . Since the model automatically operates parallel computation, it is relatively faster than the general gradient boosting framework. XGBoost also lowers the risk of over-fitting by applying different regularization penalties.
LightGBM is a gradient boosting framework designed to be fast and highly efficient 30 . When the data are high-dimensional and large, traditional gradient boosting-based models require scanning all the data instances for each feature to estimate the information gain of all the possible segmentation points, which is excessively time-consuming and inefficient. LightGBM uses Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to deal with this problem. With those techniques, LightGBM reduces the number of samples and the number of features in the dataset.
CatBoost is a gradient boosting on decision trees algorithm that presents an innovative technique to process categorical features, and a variant of gradient boosting which is a permutation-driven alternative 31 . Both methods were created to resist a prediction shift caused by a target leakage, which is present in other implementations of gradient boosting algorithms.
The hyperparameter optimization process is the most cumbersome part of machine learning project. Therefore, diverse optimization techniques are used to simplify the procedure. In this work, we selected Bayesian optimization, which is one of the most commonly used optimization method for hyperparmeter tuning. The hyperparameters to be optimized were selected considering both the characteristics of the dataset and the classifier model. Selected hyperparameters of each model were optimized with a technique based on bayesian optimization using Optuna 32 .

Results
Clustering results. Various feature scaling methods were applied to the numerical features of the dataset and MI-LightGBM-RFE was used for the feature selection. First, MI scores according to AHI cut-off values were computed for all input features to filter out less informative variables. Computed MI scores are shown in Fig. 2. After this process, less important features were eliminated through LightGBM-RFE method. The number of features was determined by the 5-fold cross-validation. Cross-validation result of LightGBM-RFE is shown in Fig. 3. Hip circumference, head circumference, age, neck circumference (sitting position), weight, BMI, abdominal circumference were selected as features for the mild OSAS (AHI ≥ 5) clustering. For the moderate OSAS (AHI ≥ 15), age, abdominal circumference, PSQI total score, BMI, weight, hip circumference, SSS total score, head circumference, height were selected. For the severe OSAS (AHI ≥ 30), sex, hours of sleep, abdominal circumference, weight, hip circumference, SSS total score, head circumference, height were selected. www.nature.com/scientificreports/ All of the selected clustering algorithms were applied to datasets of scaled and selected features. The clustering results with the best classification accuracy of the test dataset were selected for the final prediction models. Among the selected clustering algorithms, hierarchical agglomerative clustering recorded the best classification accuracy when the AHI cut-off value is 5. GMM exhibited highest classification accuracy for the moderate OSAS (AHI ≥ 15). For the severe OSAS (AHI ≥ 30), K-means showed the best performance. The number of clusters was determined using the elbow method based on the silhouette score, and it was determined to be 2 for all AHI cut-off values.
Classification results by machine learning models and feature engineering methods. In the classification accuracy analysis, CatBoost was the best with 87.52% for the mild OSAS. LightGBM recorded the best, achieving 86.01% and 91.11% in the classification of moderate OSAS and severe OSAS, respectively. Figure 4 shows the classification accuracy according to classification algorithms. Overall, LightGBM showed the best performance in all severity classes. On the other hand, Random forest showed the lowest performance in all severity classes showing significant differences from the other machine learning models. We adopted diverse methods for the dataset in the feature engineering procedure in which all of them were trained and evaluated. For the mild OSAS, applying AHI prediction with neck circumference in a lying position, and applying this method with body measurement ratio showed the best accuracy with 87.48%. For the moderate OSAS, applying weighted ESS, and appying weighted ess with body measurement ratio showed the best accuracy with 84.41%. When predicting the severe OSAS, the best performing feature engineering methods were showed similar with the ones in mild OSAS. The best accuracy was 88.13%. Figure 5 shows the classification accuracy according to feature engineering methods.
Classification results by approaches building prediction models. The prediction results with clustering showed significantly superior performance compared to the prediction results without clustering. The report of classification metrics is presented in Table 2. Statistical significance was tested using the Mann-Whitney U test (significance level 0.05). Using clustering to build a classification model was statistically significant for mild and moderate OSAS classifications compared to without clustering, while it was not for severe OSAS classifications.
In terms of classification accuracy, the approach of clustering with feature engineering and hyperparameter tuning showed the best in moderate and severe OSAS predictions, exhibiting 87.84% and 91.06%, respectively. However, clustering with feature engineering showed the highest accuracy with 88.16% when predicting mild OSAS.
ROC curves according to severity classes of OSAS and approaches to build the predictive models are visualized in Fig. 6. In common with the results of the accuracy analysis, the best AUC value was observed when www.nature.com/scientificreports/ predicting after clustering with feature engineering and hyperparameter tuning in moderate and severe OSAS predictions. When it comes to predicting mild OSAS, clustering with feature engineering was the best.

Discussion
In this study, the predictive models for the severity of OSAS were developed by applying various machine learning methodologies. The applicability of the model was tested and analyzed according to the severity. Using MI-LightGBM-RFE, we identified that important features according to each AHI cut-off value for clustering. We also discovered that hierarchical agglomerative clustering, GMM, and K-means clustering are effective for  Table 2. The report of classification metrics of predictive models by approaches. Data are reported as mean (standard deviation) and [score range]. * p < 0.05 was statistically significant. ** Accuracies of the results were statistically tested and the classification results without clustering were used as the baseline for the statistical test (Mann-Whitney U test). The gold standard for diagnosing OSAS is PSG. Although, PSG has the disadvantages of being laborious, time-consuming, and expensive. Therefore, many studies have been conducted to develop methods for screening OSAS without performing PSG, and the application of machine learning techniques has also been widely used [33][34][35][36][37] . In recent years, researches on the South Korean population have also been actively conducted. However, there were limitations in that the experiment was conducted on a minority population and focused only on supervised learning 38,39 .

Predictive model building approach AHI cut-off value Accuracy (%) AUC (%) f1 (%) Precision (%) Recall (%) p-value
To the best of our knowledge, this work has the best performance among studies predicting OSAS severity from South Korean population using machine learning techniques. Compared to previous studies, this study is significant not only in terms of the research results but also in terms of the research process. In this work, we suggested a new methodology that uses both supervised and unsupervised learning algorithms to predict the severity of OSAS using machine learning techniques. Moreover, our experiment is important in that it has so far targeted the largest South Korean population in the research of predicting OSAS severity using the application of machine learning algorithms.
Despite the appreciable prediction performance, there are several limitations in this study. Since the data were collected from only one sleep clinic, this result is difficult to be estimated for the population of other sleep centers. In addition, a considerable amount of missing values existed in the provided data because this work is a retrospective study.
OSAS is a major worldwide public health concern with an increasing prevalence. Therefore, there is a need for OSAS severity prediction models which can be used in clinical settings. Our work provides the basis for confirming the sufficient potential for utilizing machine learning in OSAS severity prediction, and also suggests outcome prediction models may be useful for screening priorities that assign patients to PSG.

Conclusion
In this study, we predicted the severity of OSAS with only simple information such as gender and age, body measurement, and questionnaire using diverse machine learning techniques. Compared to the general supervised learning-based machine learning application, the approach of applying machine learning techniques using both supervised and unsupervised learning showed significant performance in OSAS severity prediction. The results of this work demonstrate the superiority of OSAS screening applicability using machine learning methods. Due to the retrospective nature of the study, a considerable amount of data was unavailable for reasons such as missing values, and the data was collected from a single institution, which may introduce bias. Future work could be conducted with data from a larger population at various institutions to improve upon this study. In conclusion, the predictive model presented in this study presents an accurate estimated severity class of OSAS, which provides important evidence that OSAS can be effectively screened without time-consuming and labor-intensive tests.

Data availability
The data that support the findings of this study are available from NYX corporation but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are available from the authors upon reasonable request and with permission of NYX corporation.